Document indexing: a concept-based approach to term weight estimation

نویسندگان

  • Bo-Yeong Kang
  • Sang-Jo Lee
چکیده

Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment. 2004 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...

متن کامل

Correlation Preserved Indexing Based Approach For Document Clustering

Document clustering is the act of collecting similar documents into clusters, where similarity is some function on a document. Document clustering method achieves 1) a high accuracy for documents 2) document frequency can be calculated 3) term weight is calculated with the term frequency vector. Document clustering is closely related to the concept of data clustering. Document clustering is a m...

متن کامل

A Novel Approach to Semantic Indexing Based on Concept

This paper suggests the efficient indexing method based on a concept vector space that is capable of representing the semantic content of a document. The two information measure, namely the information quantity and the information ratio, are defined to represent the degree of the semantic importance within a document. The proposed method is expected to compensate the limitations of term frequen...

متن کامل

Text Rank: A Novel Concept for Extraction Based Text Summarization

Indexing used in text summarization has been an active area of current researches. Text summarization plays a crucial role in information retrieval. Snippets generated by web search engines for each query result is an application of text summarization. Existing text summarization techniques shows that the indexing is done on the basis of the words in the document and consists of an array of the...

متن کامل

Hybrid Clustering Approach for Concept Generation

Information retrieval is one of the major research areas due to accumulation of huge information in digital form. Various techniques of Information retrieval are based on the fact that various terms present in a document along with their frequency of occurrence signify the semantics of the document. Recent attempts to find the relevant document for a context represents documents in a Latent Sem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2005